In this analysis, I have adapted a methodology originally used to predict the occurrence of crime to predict 311 calls for street trees. It is likely that selection bias plays a similar role in this dynamic as it does in predictive policing, for a number of reasons. These include the fact that some philly residents are unaware of or unable to access the 311 service due to technological or language barriers, whereas others might be ‘power users’ of the service who might over-use the service by submitting multiple requests for a single event. Additionally, residents in wealthier neighborhoods might be more concerned about the aesthetic appeal of street trees, while a resident in a less affluent area might be more concerned about basic services. Finally, factors related to the trees themselves, such as size or presence of infrastructure features like sewer inlets or aboveground wires can lead to higher tree mortality rates or higher rates of trees coming into conflict with infrastructure.These factors and others could lead to a disproportionate number of requests from certain areas or demographics, causing those areas of the city to be under- or over-represented in the data, independent of the actual distribution of street tree work needed.
The purpose of adapting this predictive policing algorithm to predict Street Tree 311 calls is to facilitate preventative care and avoid property damage from fallen trees. It must be said that this theoretical care is highly aspirational, as there are currently thousands of tree work calls waiting to be resolved. However, if the city were able to clear the backlog of tree maintenance requests, this tool could be used to focus preventative tree care efforts in areas most likely to experience 311 calls about street trees. After all, as the famous Ben Franklin quote goes, ‘an ounce of prevention is worth a pound of cure.’
Clear Environment
Load Libraries
library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(classInt) # for KDE and ML risk class intervals
library(tigris)
library(arcpullr)
library(units)
library(zip)
library(rgdal)
library(osmdata)
Data Wrangling
# # 311 Requests API (needs work)
# # StreetTreeRequests_ALL <- get_spatial_layer("https://services.arcgis.com/fLeGjb7u4uXqeF9q/ArcGIS/rest/services/philly311__public_cases/FeatureServer/0/") %>%
# # filter(service_name == 'Street Trees') %>%
# # filter(lat != "") %>%
# # st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
# # st_transform(crs = coordinate_system)
#
# # 311 Requests (by year)
# # StreetTreeRequests <- st_read("/Users/oliveratwood/Downloads/Philly 311/public_cases_fc2022.csv") %>%
# # StreetTreeRequests <- st_read("/Users/oliveratwood/Downloads/Philly 311/public_cases_fc2021.csv") %>%
# StreetTreeRequests <- st_read("/Users/oliveratwood/Downloads/Philly 311/public_cases_fc2019.csv") %>%
# # StreetTreeRequests <- st_read("/Users/oliveratwood/Downloads/Philly 311/public_cases_fc2018.csv") %>%
# filter(service_name == 'Street Trees') %>%
# filter(lat != "") %>%
# st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
# st_transform(crs = coordinate_system)
#
# # Specify the file path where you want to save the GeoJSON file
# # file_path <- "/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Week_6/PredictiveTreeing/Data/StreetTreeRequests2022.geojson"
# # file_path <- "/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Week_6/PredictiveTreeing/Data/StreetTreeRequests2021.geojson"
# file_path <- "/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Week_6/PredictiveTreeing/Data/StreetTreeRequests2019.geojson"
# # file_path <- "/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Week_6/PredictiveTreeing/Data/StreetTreeRequests2018.geojson"
#
# # Write the data frame as GeoJSON
# st_write(StreetTreeRequests, file_path)
#
# LOAD +WRITE TREE INVENTORY
# trees <- get_spatial_layer("https://services.arcgis.com/fLeGjb7u4uXqeF9q/arcgis/rest/services/PPR_Tree_Inventory_2022/FeatureServer/0/") %>%
# filter(TREE_DBH > 0, TREE_DBH < 100) %>%
# st_transform(crs = coordinate_system) %>%
# st_join(philly_boundary) %>%
# mutate(Legend = "PhillyTrees") %>%
# mutate(Legend2 = "DBH")
#
# # Write the data frame as GeoJSON
# file_path <- "/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Week_6/PredictiveTreeing/Data/PPR_Tree_Inventory_OA.geojson"
# st_write(trees, file_path)
#
# # Compress the file using gzip
# system(paste("gzip", file_path))
Set Parameters
# Coordinate System
coordinate_system <- 2272
# # Resolution
resolution <- 600
palette_sequential <- c("#AF2E1B", "#CC6324", "#BFA07A", "#E2CEBA", "white")
palette_diverging <- c("#AF2E1B", "white", "#3B4B59")
# # #Load Boundary File
# query <- opq("Philadelphia, PA, USA") %>%
# add_osm_feature("boundary", "administrative")
#
# philly_boundary <- osmdata_sf(query)
# philly_boundary <- philly_boundary$osm_polygons
# philly_boundary <- philly_boundary %>% st_transform(crs = coordinate_system)
#
# #Load Boundary File
# philly_boundary <- counties(state = 42) %>%
# filter(NAME == "Philadelphia")
#PHILLYBOUNDARY
philly_boundary <- st_read("https://raw.githubusercontent.com/olivegardener/musa_5080_2023/main/Week_6/PredictiveTreeing/Data/City_Limits.geojson") %>%
st_transform(crs = 2272)
Load data for independent variables
#TREE INVENTORY
# Load and Unzip tree inventory
file_path_gz <- "https://raw.githubusercontent.com/olivegardener/musa_5080_2023/main/Week_6/PredictiveTreeing/Data/PPR_Tree_Inventory_OA.geojson.gz" # Unzipping and reading the content
con <- gzcon(file(file_path_gz, "rb"))
geojson_content <- readLines(con)
close(con)
temp_file <- tempfile(fileext = ".geojson")
writeLines(geojson_content, temp_file) # Write uncompressed content to a temporary file
trees <- st_read(temp_file, quiet = TRUE) %>% # Read the GeoJSON from the temporary file
st_transform(crs = 2272)
unlink(temp_file) # Delete the temporary file after reading it
# ACS DATA
#2018 for Philly
tracts2018 <-
get_acs(geography = "tract",
variables = c("B25026_001E",
"B19013_001E",
"DP04_0046PE"),
year=2018, state=42, county="101",
geometry=TRUE, output="wide") %>%
rename(TotalPop = B25026_001E,
MedHHInc = B19013_001E,
HomeOwnRate = DP04_0046PE) %>%
mutate(area = st_area(geometry)) %>%
mutate(year = "2018",
PopDensity = drop_units(TotalPop / (area / 2.788e+7))) %>%
mutate(HomeOwnRate = ifelse(is.na(HomeOwnRate), 0, HomeOwnRate)) %>%
mutate(MedHHInc = ifelse(is.na(MedHHInc), 0, MedHHInc)) %>%
dplyr::select(-NAME, -TotalPop, -area, -starts_with("D"), -starts_with("B"))%>%
st_transform(crs = 2272)
# NEIGHBORHOODS
neighborhoods <- get_spatial_layer("https://services1.arcgis.com/a6oRSxEw6eIY5Zfb/ArcGIS/rest/services/Philadelphia_Neighborhoods/FeatureServer/0")%>%
st_transform(crs = 2272)
# Points
#VACANT LOTS
gdb_path <- "/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Midterm/data/Features4PPA.gdb"
layers <- ogrListLayers(gdb_path)
Vacant_Lots <- sf::st_read(gdb_path, layer = "VacantLots")%>%
st_transform(crs = 2272) %>%
mutate(Legend = 'VacantLots') %>%
dplyr::select(Legend) %>%
st_centroid(Vacant_Lots)
#SEWER INLETS
SewerInlets <- st_read("/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Week_6/PredictiveTreeing/Data/PhillyWater_INLETS/PhillyWater_INLETS.shp") %>%
st_transform(crs = 2272) %>%
dplyr::filter(SYMBOLGROU == "Inlet") %>%
mutate(Legend = 'SewerInlets') %>%
dplyr::select(Legend)
# Lines
#HISTORIC STREAMS
HistoricStreams <- st_read("/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Week_6/PredictiveTreeing/Data/HistoricStreams_Arc-shp/0e0b38fd-a993-4007-88a3-0373a0035f77202041-1-vo11h2.z0onl.shp")%>%
st_transform(crs = 2272) %>%
mutate(Legend = 'HistoricStreams') %>%
dplyr::select(Legend)
#CURBLINES
curblines <- st_read("/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Week_6/PredictiveTreeing/Data/curblines/curblines.shp")%>%
st_transform(crs = 2272) %>%
mutate(Legend = 'curblines') %>%
dplyr::select(Legend)
Load in Selected 311 Data
# Load Street Tree 311 Requests (previously cleaned, see commented code above)
StreetTreeRequests <- st_read("https://raw.githubusercontent.com/olivegardener/musa_5080_2023/main/Week_6/PredictiveTreeing/Data/StreetTreeRequests2018.geojson")%>%
st_transform(crs = 2272)
# uses grid.arrange to organize independent plots
grid.arrange(ncol=2,
ggplot() +
geom_sf(data = philly_boundary) +
geom_sf(data = StreetTreeRequests, colour="orange", size=0.1, show.legend = "point") +
labs(title= "Tree 311 Calls, Philadelphia - 2022") +
mapTheme(title_size = 10),
ggplot() +
geom_sf(data = philly_boundary, fill = "grey40") +
stat_density2d(data = data.frame(st_coordinates(StreetTreeRequests)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_viridis() +
scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
labs(title = "Density of Tree 311 Calls") +
mapTheme(title_size = 10) + theme(legend.position = "none"))
These maps show how street tree 311 calls are distributed across
the city. It would appear that there are some hotspots.
## using {sf} to create the grid
## Note the `.[philly_boundary] %>% ` line. This is needed to clip the grid to our data
fishnet <-
st_make_grid(philly_boundary,
cellsize = 1000,
square = TRUE) %>%
# hexagon = TRUE) %>% #option for a hexagonal grid
.[philly_boundary] %>% # fast way to select intersecting polygons
st_sf() %>%
mutate(uniqueID = 1:n())
## add a value of 1 to each tree request, sum them with aggregate
tree_net <-
dplyr::select(StreetTreeRequests) %>%
mutate(countRequests = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countRequests = replace_na(countRequests, 0),
uniqueID = 1:n(),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
ggplot() +
geom_sf(data = tree_net, aes(fill = countRequests), color = NA) +
scale_fill_viridis() +
labs(title = "Count of Street Tree Service Requests for the Fishnet") +
mapTheme()
Joining the count of street tree 311 requests to the fishnet highlights
these hotspots, where single cells are bright with a very high number of
requests! Could this be indicative of actual concentrations of tree work
needing to be done, or is it rather just a small handful of particularly
persistent citizens?
Nearest Neighbor Features
## NN from trees
vars_net <- fishnet %>%
mutate(trees.nn = nn_function(st_coordinates(st_centroid(fishnet)),
st_coordinates(trees),
k = 5))
## Join NN feature to our fishnet
vars_net <- left_join(tree_net, st_drop_geometry(vars_net), by="uniqueID")
# ## NN from vacant lots
# vars_net <- vars_net %>%
# mutate(Vacant_Lots.nn = nn_function(st_coordinates(st_centroid(fishnet)),
# st_coordinates(Vacant_Lots),
# k = 5))
# ## Join NN feature to our fishnet
# vars_net <- left_join(tree_net, st_drop_geometry(vars_net), by="uniqueID")
Joining Spatial Features
vars_net <- trees %>%
st_join(vars_net, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n(),
sum_TREE_DBH = sum(TREE_DBH, na.rm = TRUE), # Calculate sum of TREE_DBH
mean_TREE_DBH = mean(TREE_DBH, na.rm = TRUE)) %>% # Calculate sum of TREE_DBH
left_join(vars_net, ., by = "uniqueID") %>%
spread(Legend, count, fill=0) %>%
dplyr::select(-`<NA>`) %>%
ungroup() %>%
mutate(sum_TREE_DBH = ifelse(is.na(sum_TREE_DBH), 0, sum_TREE_DBH)) %>%
mutate(mean_TREE_DBH = ifelse(is.na(mean_TREE_DBH), 0, mean_TREE_DBH)) # Replace NA values with 0
vars_net <- SewerInlets %>%
st_join(vars_net, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>% # Calculate count of trees
left_join(vars_net, ., by = "uniqueID") %>%
spread(Legend, count, fill=0) %>%
dplyr::select(-`<NA>`) %>%
ungroup()
vars_net <- Vacant_Lots %>%
st_join(vars_net, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>% # Calculate count of trees
left_join(vars_net, ., by = "uniqueID") %>%
spread(Legend, count, fill=0) %>%
dplyr::select(-`<NA>`) %>%
ungroup()
# ##JOIN CURBS
# # Calculate intersections
# intersections <- st_intersection(vars_net, curblines)
#
# # Count intersections by fishnet uniqueID
# intersection_counts <- intersections %>%
# group_by(uniqueID) %>%
# summarize(count = n()) %>%
# ungroup()
#
# # Join counts back to the original fishnet (if needed)
# vars_net <- st_join(vars_net, intersection_counts, by = "uniqueID") %>%
# mutate(curbcount = ifelse(is.na(count), 0, count)) %>%
# dplyr::select(-count, -uniqueID.x, -uniqueID.y) %>%
# mutate(uniqueID = row_number())
# ##JOIN HISTORIC STREAMS
# # Calculate intersections
# intersections <- st_intersection(vars_net, HistoricStreams)
#
# # Count intersections by fishnet uniqueID
# intersection_counts <- intersections %>%
# group_by(uniqueID) %>%
# summarize(count = n()) %>%
# ungroup()
#
# # Join counts back to the original fishnet (if needed)
# vars_net <- st_join(vars_net, intersection_counts, by = "uniqueID") %>%
# mutate(H_Streams = ifelse(is.na(count), 0, count)) %>%
# dplyr::select(-count, -uniqueID.x, -uniqueID.y) %>%
# mutate(uniqueID = row_number())
Joining areal data
final_net <- st_centroid(vars_net) %>%
st_join(dplyr::select(tracts2018, MedHHInc), by = "uniqueID") %>%
st_join(dplyr::select(tracts2018, PopDensity), by = "uniqueID") %>%
st_join(dplyr::select(tracts2018, HomeOwnRate), by = "uniqueID") %>%
st_join(dplyr::select(neighborhoods, NAME), by = "uniqueID") %>%
st_drop_geometry() %>%
left_join(dplyr::select(vars_net, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()
final_net <- final_net %>%
rename(neighborhoods = NAME)
# for live demo
# mapview::mapview(final_net, zcol = "District")
final_net.long <- final_net %>%
dplyr::select(-neighborhoods, -cvID)
final_net.long <- gather(final_net.long, Variable, value, -geometry, -uniqueID, -countRequests)
vars <- unique(final_net.long$Variable)
vars <- unique(final_net.long$Variable)
mapList <- list()
for(i in seq_along(vars)){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.long, Variable == vars[i]), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=vars[i]) +
mapTheme()
}
do.call(grid.arrange,c(mapList, ncol=3, top="Influencing Factors by Fishnet"))
This map reveals some of these features have similar spatial
distribution to street tree 311 calls.
Calculate Local Moran’s I
## generates warnings from PROJ issues
## {spdep} to make polygon to neighborhoods...
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
# print(final_net.weights, zero.policy=TRUE)
local_morans <- localmoran(final_net$countRequests, final_net.weights, zero.policy=TRUE) %>%
as.data.frame()
# join local Moran's I results to fishnet
final_net.localMorans <-
cbind(local_morans, as.data.frame(final_net)) %>%
st_sf() %>%
dplyr::select(TreeCount = PhillyTrees,
Local_Morans_I = Ii,
P_Value = `Pr(z != E(Ii))`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
gather(Variable, Value, -geometry)
correlation.long <-
st_drop_geometry(final_net) %>%
dplyr::select(-uniqueID, -cvID, -neighborhoods) %>%
gather(Variable, Value, -countRequests)
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, countRequests, use = "complete.obs"))
ggplot(correlation.long, aes(Value, countRequests)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "black") +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Street Tree 311 Calls as a function of risk factors") +
plotTheme()
These scatterplots suggest a number of things about the
relationship between street tree 311 calls and our factors of interest.
We see some strong positive correlations between popDensity,
PhillyTrees, SewerInlets, and sum_TREE_DBH, suggesting that the people,
more trees, stormwater infrastructure, and the bigger the trees of a
given area, the more likely that area is to place street tree 311
calls.
ggplot(final_net, aes(x=countRequests)) +
geom_histogram(binwidth=1, fill="blue", color="black", alpha=0.7) + # You can adjust binwidth as needed
labs(title="Histogram of Street Tree 311 Calls",
x="Number of Calls",
y="Frequency") +
theme_minimal()
This histogram shows that across the city, most fishnet cells do
not have any calls, and there is a steady dropoff from cells with lower
numbers of calls to those with higher numbers of calls. Notable here is
that at the high end, there are increases in the frequency of calls,
which suggests some areas have people who are ‘frequent callers’ or
trees who are ‘frequent fallers.’ Perhaps we should remove them?
Choose Variables
# Version 1
# reg.vars <- c("trees.nn", "sum_TREE_DBH", "mean_TREE_DBH", "PhillyTrees",
# "SewerInlets", "VacantLots", "curbcount",
# "H_Streams", "MedHHInc", "PopDensity")
#
# reg.ss.vars <- c("trees.nn", "sum_TREE_DBH", "mean_TREE_DBH", "PhillyTrees",
# "SewerInlets", "VacantLots", "curbcount",
# "H_Streams", "MedHHInc", "PopDensity", "treecall.isSig", "treecall.isSig.dist")
# Version 2
# reg.vars <- c("trees.nn", "sum_TREE_DBH", "PhillyTrees",
# "SewerInlets", "VacantLots",
# "H_Streams", "MedHHInc", "PopDensity")
#
# reg.ss.vars <- c("trees.nn", "sum_TREE_DBH", "PhillyTrees",
# "SewerInlets", "VacantLots",
# "H_Streams", "MedHHInc", "PopDensity", "treecall.isSig", "treecall.isSig.dist")
# Version 3
# reg.vars <- c("PhillyTrees", "sum_TREE_DBH", "SewerInlets", "PopDensity", "HomeOwnRate")
#
# reg.ss.vars <- c("PhillyTrees", "sum_TREE_DBH", "SewerInlets", "PopDensity", "HomeOwnRate",
# "treecall.isSig", "treecall.isSig.dist")
# Version 4
reg.vars <- c("sum_TREE_DBH", "SewerInlets", "PopDensity")
reg.ss.vars <- c("sum_TREE_DBH", "SewerInlets", "PopDensity",
"treecall.isSig", "treecall.isSig.dist")
# Version 5
# reg.vars <- c("PhillyTrees", "PopDensity")
#
# reg.ss.vars <- c("PhillyTrees", "SewerInlets", "PopDensity",
# "treecall.isSig", "treecall.isSig.dist")
‘Leave One Group Out’ Cross-Validation on spatial features
reg.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countRequests",
indVariables = reg.vars) %>%
dplyr::select(cvID = cvID, countRequests, Prediction, geometry)
reg.ss.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countRequests",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = cvID, countRequests, Prediction, geometry)
reg.spatialCV <- crossValidate(
dataset = final_net,
id = "neighborhoods",
dependentVariable = "countRequests",
indVariables = reg.vars) %>%
dplyr::select(cvID = neighborhoods, countRequests, Prediction, geometry)
reg.ss.spatialCV <- crossValidate(
dataset = final_net,
id = "neighborhoods",
dependentVariable = "countRequests",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = neighborhoods, countRequests, Prediction, geometry)
Cross-validation is a technique used to assess how well a model
will generalize to an independent dataset. It involves partitioning the
original dataset into a training set to train the model and a test set
to evaluate it.
reg.summary <-
rbind(
mutate(reg.cv, Error = Prediction - countRequests,
Regression = "Random k-fold CV: Just Risk Factors"),
mutate(reg.ss.cv, Error = Prediction - countRequests,
Regression = "Random k-fold CV: Spatial Process"),
mutate(reg.spatialCV, Error = Prediction - countRequests,
Regression = "Spatial LOGO-CV: Just Risk Factors"),
mutate(reg.ss.spatialCV, Error = Prediction - countRequests,
Regression = "Spatial LOGO-CV: Spatial Process")) %>%
st_sf()
error_by_reg_and_fold <-
reg.summary %>%
group_by(Regression, cvID) %>%
summarize(Mean_Error = mean(Prediction - countRequests, na.rm = T),
MAE = mean(abs(Mean_Error), na.rm = T),
SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
ungroup()
st_drop_geometry(error_by_reg_and_fold) %>%
group_by(Regression) %>%
summarize(Mean_MAE = round(mean(MAE), 2),
SD_MAE = round(sd(MAE), 2)) %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
row_spec(0, color = "black", background = "white") %>%
row_spec(1, color = "black", background = "grey") %>%
row_spec(2, color = "black", background = "#FDE725FF") %>%
row_spec(3, color = "black", background = "grey") %>%
row_spec(4, color = "black", background = "#FDE725FF")
| Regression | Mean_MAE | SD_MAE |
|---|---|---|
| Random k-fold CV: Just Risk Factors | 0.33 | 0.28 |
| Random k-fold CV: Spatial Process | 0.32 | 0.27 |
| Spatial LOGO-CV: Just Risk Factors | 0.78 | 0.90 |
| Spatial LOGO-CV: Spatial Process | 0.76 | 0.90 |
These are some pretty low MAE values. How do they break down across differing demographics?
tracts18 <-
get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"),
year = 2018, state=42, county="101", geometry=T) %>%
dplyr::select(variable, estimate, GEOID) %>%
spread(variable, estimate) %>%
rename(TotalPop = B01001_001,
NumberWhites = B01001A_001) %>%
mutate(percentWhite = NumberWhites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
st_transform(crs = 2272) %>%
.[neighborhoods,]
# Create the table
error_by_reg_and_race <-
reg.summary %>%
st_centroid() %>%
st_join(tracts18) %>%
na.omit() %>%
st_drop_geometry() %>%
group_by(Regression, raceContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(raceContext, mean.Error)
# Display the table
error_by_reg_and_race %>%
kable(caption = "Mean Error by Regression Type and Neighborhood Racial Context") %>%
kable_styling("striped", full_width = F) %>%
row_spec(0, color = "black", background = "white") %>%
row_spec(1, color = "black", background = "grey") %>%
row_spec(2, color = "black", background = "#FDE725FF") %>%
row_spec(3, color = "black", background = "grey") %>%
row_spec(4, color = "black", background = "#FDE725FF")
This table tells us a few things: first, all four of our models are
worse at predicting the likelihood of a street tree-related 311 call in
majority non-white neighborhoods than majority white neighborhoods, both
for the risk factors alone and when considering the spatial processes.
Second, the model underestimates the number of street tree 311 calls for
majority non-white neighborhoods and slightly overestimates the number
of calls for majority white neighborhoods. Lastly, the inclusion of
spatial processes marginally improves the accuracy of the model in both
majority white and non-white neighborhoods.
We must keep in mind,
however, that these errors are means of all the errors across the city,
so it may be the case that the positive and negative errors (over- and
under-estimation of street tree 311 calls) are cancelling each other out
to some degree. Let’s make a map to investigate this.
# Create the small multiple map with the midpoint set at 0
ggplot(data = reg.summary) +
geom_sf(aes(fill = Error), color = NA) +
facet_wrap(~ Regression, ncol = 2) +
theme_minimal() +
labs(title = "Model Errors by Random K-Fold and Spatial Cross Validation",
fill = "Error") +
scale_fill_gradient2(low = palette_diverging[1],
mid = palette_diverging[2],
high = palette_diverging[3],
midpoint = 0) +
theme(legend.position = "bottom")
In this multiple map, positive errors (over-predictions) are shown as
red and negative errors (under-predictions) are shown as blue.
The
highest numbers of over-predictions appear to occur downtown, whereas
under-predictions are more evenly distributed across the city.
Since there are lots of errors on either side of zero, our low mean
error is probably not the best measure of model accuracy.
Get 2019 Street Tree 311 call data
StreetTreeRequests19 <- st_read("https://raw.githubusercontent.com/olivegardener/musa_5080_2023/main/Week_6/PredictiveTreeing/Data/StreetTreeRequests2019.geojson") %>%
.[fishnet,]
Density vs predictions
TreeCall_ppp <- as.ppp(st_coordinates(StreetTreeRequests19), W = st_bbox(final_net))
TreeCall.1000 <- density.ppp(TreeCall_ppp, resolution)
tree_KDE_sum <- as.data.frame(TreeCall.1000) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean)
kde_breaks <- classIntervals(tree_KDE_sum$value,
n = 5, "fisher")
tree_KDE_sf <- tree_KDE_sum %>%
mutate(label = "Kernel Density",
Risk_Category = classInt::findCols(kde_breaks),
Risk_Category = case_when(
Risk_Category == 5 ~ "5th",
Risk_Category == 4 ~ "4th",
Risk_Category == 3 ~ "3rd",
Risk_Category == 2 ~ "2nd",
Risk_Category == 1 ~ "1st")) %>%
cbind(
aggregate(
dplyr::select(StreetTreeRequests19) %>% mutate(TreeCallCount = 1), ., sum) %>%
mutate(TreeCallCount = replace_na(TreeCallCount, 0))) %>%
dplyr::select(label, Risk_Category, TreeCallCount)
# reg.cv
# reg.ss.cv
# reg.spatialCV
# reg.ss.spatialCV
ml_breaks <- classIntervals(reg.ss.cv$Prediction,
n = 5, "fisher")
tree_risk_sf <- reg.ss.cv %>%
mutate(label = "Risk Predictions",
Risk_Category =classInt::findCols(ml_breaks),
Risk_Category = case_when(
Risk_Category == 5 ~ "5th",
Risk_Category == 4 ~ "4th",
Risk_Category == 3 ~ "3rd",
Risk_Category == 2 ~ "2nd",
Risk_Category == 1 ~ "1st")) %>%
cbind(
aggregate(
dplyr::select(StreetTreeRequests19) %>% mutate(TreeCallCount = 1), ., sum) %>%
mutate(TreeCallCount = replace_na(TreeCallCount, 0))) %>%
dplyr::select(label,Risk_Category, TreeCallCount)
rbind(tree_KDE_sf, tree_risk_sf) %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
# geom_sf(data = sample_n(StreetTreeRequests19, 3000), size = .1, colour = "black", alpha = .1) +
facet_wrap(~label, ) +
scale_fill_viridis(discrete = TRUE) +
labs(title="Comparison of Kernel Density and Risk Predictions",
subtitle="2019 Street Tree 311 Calls") +
mapTheme(title_size = 14)
These two maps indicate that our predicted calls are more spread out
across the city than the actual density for 2019. Higher risk areas
appear to be under-predicted and lower risk areas appear to be
over-predicted.
rbind(tree_KDE_sf, tree_risk_sf) %>%
st_drop_geometry() %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(countTreeCalls = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Pcnt_of_test_set_calls = countTreeCalls / sum(countTreeCalls)) %>%
ggplot(aes(Risk_Category,Pcnt_of_test_set_calls)) +
geom_bar(aes(fill=label), position="dodge", stat="identity") +
scale_fill_viridis(discrete = TRUE, name = "Model") +
labs(title = "Call prediction vs. Kernel density, 2019 Street Tree 311 Calls",
y = "% of Test Set Calls (per model)",
x = "Risk Category") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
This bar chart confirms this suspicion.
I would not recommend this algorithm be used to predict where
Street Tree 311 calls will be placed. The algorithm, when trained on
2018 data, failed to predict calls for 2019 data with acceptable
accuracy. Though the average percent error was small, the map and bar
chart above indicate that the algorithm over-estimates risk in the lower
risk areas and underestimates risk in higher risk areas. This is the
opposite of what one would want from such an algorithm.
Further undermining the effectiveness of this model is the fact that it
underestimates the number of street tree 311 calls for majority
non-white neighborhoods and overestimates the number of calls for
majority white neighborhoods. In the interest of making preventative
tree care more equitable, one would want to refine the model to produce
predictions that are more evenly accurate across demographics. This
could be done through investigation of a wider range of independent
variables to improve prediction, further refinement of the feature
engineering process for these variables, setting a limit to the number
of calls per cell so certain areas with the ‘frequent caller/frequent
faller’ phenomenon skew the training of the algorithm less, among other
strategies.
Street trees are a vital part of Philadelphia’s
resilience infrastructure. A tool such as this, once properly refined,
could save the city millions of dollars per year in emergency
maintenance costs by better enabling preventative care. Carpe
tree-em!